-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RATIS-2084. Follower reply ALREADY_INSTALLED when install old snapshots from leader #1091
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SzyWilliam , thanks for working on this! Just have a question inlined.
final long logIndex = conf.getLogEntryIndex(); | ||
final RaftConfiguration found = configurations.get(logIndex); | ||
if (found != null) { | ||
if (found != null && logIndex <= commitIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why changing this? We should not allow two confs with the same index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- B is elected as the new leader. The first thing B comes to power is to replicate a group configuration log L1 (old=[A,B,C], new=[A,B,C]) to establish its leader authority.
@SzyWilliam , I see the problem now. This is similar to RATIS-2011. We should truncate the confs when truncating the log.
@szetszwo, thanks for reviewing this! The original 2024-05-14 13:46:03,722 [grpc-default-executor-8] WARN server.GrpcServerProtocolService (LogUtils.java:warn(123)) - s4: Failed INSTALL_SNAPSHOT request cid=0, isHeartbeat? false
java.lang.IllegalStateException
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:35)
at org.apache.ratis.server.impl.ConfigurationManager.addConfiguration(ConfigurationManager.java:68)
at org.apache.ratis.server.impl.ServerState.setRaftConf(ServerState.java:382)
at org.apache.ratis.server.impl.ServerState.setRaftConf(ServerState.java:375)
at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:138)
at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:97)
at org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1652)
at org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:674)
at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:341)
at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:338)
at org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.process(GrpcServerProtocolService.java:106)
at org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:174)
at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) The event sequence that triggers the exception is as follows (a little bit tricky):
So back to the question, I think we should not allow two confs with the same committed index. For the uncommitted confs, we should force it to be aligned with the current leader (just as Raft Algo will truncates those conflicting local uncommitted logs). What do you think? |
@szetszwo Hi Tsz-Wo, this patch is merged and tested in IoTDB snapshot version and so far so good. I hope it could catch on board with 3.1.0, what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SzyWilliam , thanks a lot for working on this! Adding logIndex <= commitIndex
to the if-condition may allow invalid confs in the map. How about fixing it by adding truncate?
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
index 4512a2c223..8a85488135 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
@@ -1853,6 +1853,7 @@ class RaftServerImpl implements RaftServer.Division,
void notifyTruncatedLogEntry(LogEntryProto logEntry) {
if (logEntry.hasStateMachineLogEntry()) {
getTransactionManager().remove(TermIndex.valueOf(logEntry));
+ getState().truncate(logEntry.getIndex());
final ClientInvocationId invocationId = ClientInvocationId.valueOf(logEntry.getStateMachineLogEntry());
final CacheEntry cacheEntry = getRetryCache().getIfPresent(invocationId);
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
index 0f46c6b523..c49e9554f0 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
@@ -387,6 +387,10 @@ class ServerState {
LOG.trace("{}: {}", getMemberId(), configurationManager);
}
+ void truncate(long logIndex) {
+ configurationManager.removeConfigurations(logIndex);
+ }
+
void updateConfiguration(List<LogEntryProto> entries) {
if (entries != null && !entries.isEmpty()) {
configurationManager.removeConfigurations(entries.get(0).getIndex());
final long logIndex = conf.getLogEntryIndex(); | ||
final RaftConfiguration found = configurations.get(logIndex); | ||
if (found != null) { | ||
if (found != null && logIndex <= commitIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- B is elected as the new leader. The first thing B comes to power is to replicate a group configuration log L1 (old=[A,B,C], new=[A,B,C]) to establish its leader authority.
@SzyWilliam , I see the problem now. This is similar to RATIS-2011. We should truncate the confs when truncating the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good. Just have a question inlined.
@@ -1853,6 +1853,7 @@ CompletableFuture<Message> applyLogToStateMachine(ReferenceCountedObject<LogEntr | |||
void notifyTruncatedLogEntry(LogEntryProto logEntry) { | |||
if (logEntry.hasStateMachineLogEntry()) { | |||
getTransactionManager().remove(TermIndex.valueOf(logEntry)); | |||
Optional.ofNullable(getState()).ifPresent(s -> s.truncate(logEntry.getIndex())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could getState()
return null? In a mock test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it should be outside the if-statement, i.e.
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
@@ -1851,6 +1851,8 @@ class RaftServerImpl implements RaftServer.Division,
* @param logEntry the log entry being truncated
*/
void notifyTruncatedLogEntry(LogEntryProto logEntry) {
+ getState().truncate(logEntry.getIndex());
+
if (logEntry.hasStateMachineLogEntry()) {
getTransactionManager().remove(TermIndex.valueOf(logEntry));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo Thanks for the quick review! Indeed the getState()
will return null in the mock test TestSegmentedRaftLog.testAppendEntriesWithInconsistency
, which only creates a SegmentedRaftLog
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info! The change looks good.
Thanks @szetszwo very much for the reviews and resolutions. |
What changes were proposed in this pull request?
What is the link to the Apache JIRA
https://issues.apache.org/jira/projects/RATIS/issues/RATIS-2084?filter=allopenissues
How was this patch tested?
New unit tests. Before the patch the IllegalStateException can be 100% reproduced in the unit test.